1 Introduction
In this introduction we will first show how you can install R and R Studio. After the installation is complete, this chapter continues with an overview of the most important panes in the R Studio IDE. In the third section, you’ll learn how to start a new project and how to organize the files in that project.
R is a programming language. It tells the computer what to do. Like any other language, it has a vocabulary and grammar rules. The vocabulary includes names for e.g. functions. For instance, if you want to create a vector (1, 2, 3) and call it my_first_vector you would write my_first_vector <- c(1, 2, 3). Here, the vocabulary includes c and stands for combine of concatenate and combines the 1, 2 and 3 in a vector. The items that you want to collect in the vector are shown between (). If you would use [] R wouldn’t know what to do and would show a error. The vector is assigned a name. The assignment operator <- assigns the name my_first_vector to that vector. To separate the values 1, 2 and 3, you use a , and, for instance, not a space or a ;. There is also a style guide. In the example, the style include a space after a comma but not before the comma, a space before and after the assignment operator and no space after the first ( or before the last ). The style guide is a guide. In other words, you can write my_first_vector<-c( 1,2,3 ).
RStudio is an Integrated Development Environment or IDE. Technically, you don’t need it to write R code. As a matter of fact, you can write R code in any word processor, save the code as an .R file and run the code from R’s console. In other words, you can use R without R Studio. However, you can not run RStudio without a programming language such as R. There are also other IDE’s that you can use. We use RStudio because it is widely used. An IDE helps you to work with R. After you have installed R and RStudio in the next section, you’ll see that RStudio creates 4 panes: you can use one pane to write scripts, one pane that shows all your variables, a pane where every command is immediately executed and one pane where you can see your files, plots and where you can find the help function. In addition, RStudio helps you as it makes suggestions to complete functions or variables, adds the closing bracket ) if you type the opening one ( and changes the color of some words you use (e.g. if or for). If you launch R without RStudio, you wouldn’t see these panes. If you enter a command in R, R executes that command.
1.1 Installing R and RStudio
There are two ways to use RStudio. The first installs a local copy on your machine. R and RStudio are available for Windows, Mac, Linux … . If you can not find a copy for your devise (e.g. iPad), you can access RStudio via the cloud. We’ll first focus on the local installation. RStudio cloud is next.
1.1.1 Local installation
You can download R and Rstudio from the Posit website. First, you need to install R, then you can install RStudio. Doing it the other way around won’t work: RStudio needs R to install. If you click the download and install R button (Figure 1.1) you will be redirected to the The Comprehensive R Archive Network or CRAN. CRAN includes all R versions and packages. For this course, you can install the most recent R version: 4.5.2.. Although there wouldn’t be too much issues if you would update to a newer version if it becomes available, keep this version for the reminder of the course. Doing so avoids that changes alter some functions that we discuss during the lectures.
As shown in Figure 1.2, you can download R for Linux, MacOS and Windows. If you use Linux, first check if your distribution doesn’t already include R. If that is not the case, you can select your distribution and download the installation file. For MacOS, the file you need to download depends on your system (i.e. modern Apple Silicon or older Intel Macs). For Windows, you can download and install the base version.
Installing R is usually quite straightforward. You can accept the standard configuration settings. All you need to do is select the directory where you want to install R.
If the installation of R is finished, you can install RStudio. On the Posit website, you can now click on the Install RStudio button (see Figure 1.3). Here, this button refers to my operating system (Windows).
If you are on a Mac or Linux device and if that button does not show your operating system, you will find it at the bottom of that page (Figure 1.4).
Installing RStudio is usually very straightforward. You can accept the standard configuration settings. All you need to so is select the directory where you want to install RStudio. We’ll use Version 2026.01.0+392. As with R, you might want to keep that version fixed for the semester.
1.1.2 Posit Cloud
If you can not install RStudio, e.g. because there is no version available for your system, you can run RStudio via Posit Cloud. RStudio will behave a little different from a local installation, but the experience will essentially be the same.
There is a free plan as well as a student plan ($5 per month). You can see a summary of these plans in Figure 1.5.
The only difference is the amount of compute time. For the free plan, you have 25 hours per month. This increases to 75 hours in the student plan. Given the settings of the plan, compute time is very close to the actual time that you will use RStudio to do calculations. Note that this will be less than the number of hours that you use R. Writing code for instance does not require processor capacity and will not add to your compute time. You can always switch plans if you feel the free plan offers you too little compute hours. You will not need to other (more expensive) options. These usually include more computing power (e.g. 8 CPU’s) and are a good option is you work with very large datasets or run large and processor intensive code. This is not what we will do here. To use RStudio cloud, you need to register. Once registered, you will be able to log in. The opening screen looks like the one in Figure 1.6.
1.2 RStudio
1.2.1 A tour of RStudio
If you installed a local copy, you should have two new icons on your desktop: the R logo and the RStudio logo. We’ll always use RStudio. If you open R studio, you’ll see a screen like the one shown in Figure 1.7 (other than the version of R, the opening screen will be very similar).
From File in the top left cornder of the menu, select New file and R Script as shown in Figure 1.8. This opens a new window.
You can also use the keyboard shortcut Ctrl + Shift + N (Windows) or Command + Shift + N (Mac).
In Posit Cloud, you first need to setup a new project. To do so you click on New projects in the top right of Figure 1.5 and select New RStudio project. You will then see a screen similar to Figure 1.7. From the File menu, select New file and R Script as shown in Figure 1.8. This opens a similar window as the one in the desktop version.
As shown in Figure 1.9, the RStudio windows has 4 panes:
Console pane
Source pane (or editor): allows to edit (editor) and view data
Environment pane (with tabs such as Environment, History, … )
Output pane (with tabs such as Files, Plots, Help Packages, … )
If you want, you can modify the tabs in each window. From Tools, select Global options and Pane layout on the right hand side. However, for now, you can leave the panes as they are.
Let’s start at the bottom left of RStudio: the console. The console allows you to interactively execute code. For instance if you type 2 + 2 the console will show [1] 4. Every line you type in the console is immediately executed and the result is shown. If you type 2 + 2 you’ll see the result. If you then type 4 + 4, R will immediately show the output [1] 8.
Now let’s do the same in the editor in the top left: write 3 + 3 in that pane and press enter. As you can see, nothing happens: the editor does not show any result nor is the result shown in the console. After enter, the cursor moved to the next line, but RStudio didn’t return any result. Add a second line of code in the editor: 4 + 4 and press enter. Again, other than the cursor moving to the next line, noting happens. Now select both lines and press the small button on the top right corner of the editor that says Run with a green arrow (Figure 1.10).
If you look at the console now, you’ll see the result for both lines of code
> 3 + 3
[1] 6
> 4 + 4
[1] 8
The console shows the lines that you ran (3 + 3 and 4 + 4) and after each line the result [1] 6 and [1] 8.
In other words, the editor waits to execute the code until you explicitly ask RStudio to run the code. R executes the code line by line and shows the output in the console.
To run the code in the previous example, you selected the lines you wanted to run and then pressed Run. There are several useful keyboard shortcuts to run parts of all code that you write in the editor:
| Run current line/selection | Ctrl+Enter | Command+Return |
| Run current line/selection (retain cursor position) | Alt+Enter | Option+Return |
| Run from document beginning to current line | Ctrl+Alt+B | Option+Command+B |
| Run from current line to document end | Ctrl+Alt+E | Option+Command+E |
To illustrate, write 5 + 5 in the editor on new line after the two lines 3 + 3and 4 + 4 and keep the cursor at the end of this line:
Ctrl + Enter (Command + Return) will run the line
5 + 5and show the result in the console. The cursor will move to the next lineAlt + Enter (Option + Return) will run the line
5 + 5and show the result in the console. The cursor will stay at the end of the lineCtrl + Alt + B (Option + Command + B) will run all lines
3 + 3,4 + 4and5 + 5and show the result for each line in the console. The cursor stays at the end of the line. This option executes all code before the cursor. It runs to code from the beginning (the B in Ctrl + Alt) to the position of the cursor.
Now move your cursor to the start of the first line (before the first 3 in 3 + 3):
- Ctrl + Alt + E (Option + Command + E) will run all lines
3 + 3,4 + 4and5 + 5and show the result for each line in the console. The cursor stays at the beginning of the first line. This option executes all code after the cursor. This command runs all the code from the position of the cursor to the end (the E in Ctrl + Alt).
You can save the code that you wrote in the editor as an R script via File and Save as. R will then save your code in an your-file_name.R file. The .R extension shows that his file includes R code. If you want to continue with your script, you follow Files and Open to select the script you want to open.
Note that RStudio doesn’t autosave your work. If you would accidently try to close a file in the editor, RStudio will warn that there are unsaved parts in that file, but it does not save your work unless you explicitly ask it to. To save a file, you can use Ctrl + S (Command + S) or click on the small blue disk in the top left corner.
In the editor, you can have multiple scrips open at the same time. Their name (or Untitled) will be shown below RStudio’s menu.
In RStudio, you will usually work in the editor. Recall from Chapter 1 that a typical data science project includes various steps: importing data, tidying data, data transformation, running a model, preparing a data visualization and output. These steps often require various lines of code. Using the console for that purpose would be very inefficient. First, you will make mistakes. If you do so in the middle of a long line of commands, you’ll need to restart from the first line if you use the console. In an R script, you can correct that mistake and continue. Second, you can reuse a script. This is not possible if you use the console. The ability so save and rerun code is crucial. Suppose that your job requires you to present monthly sales data in a powerpoint presentation. If you include all steps in a script, you can reuse that script every month. Your first presentation will take some time to code, but to prepare all other presentations the effort will be limited: open the script, change a couple of lines (e.g. the name of an excel file) and hit “run”.
There are two panes left: the environment pane and the files pane. To illustrate what the environment pane does, use the editor and type and run the following line:
a <- 100Don’t worry too much about the exact meaning of <- here, we will cover that shortly. In short, this line says that we will assign the value of 100 to a variable called a.
Note that two things happen as you run a <- 100. First, as expected, you see the result in the console:
> a <- 100
Second, in the environment pane in the top right, you see that variable a was created and that the value assigned to that variable is 100. The value can now be used in any part of your R session. If you type and run a in the editor
a[1] 100
The console shows the value of a:
[1] 100
You can now use a in your code. For instance, if you type and run the next line in the editor, the console will show the result:
a + 10[1] 110
Now add the following line in the editor:
df <- mtcarsmtcars is one of the many datasets that are included in the R/RStudio package. Here we copy that dataset and assign that copy to the object df (which stands for data frame, a data structure we’ll cover in e.g. Chapter 2). As we assign mtcars to df, df and mtcars are copies. However, if we change df, that will not affect mtcars.
The environment pane now shows 2 parts: Data and Values. The Data part shows the datasets that you have opened. The environment pane shows that this dataset includes 32 observations for 11 variables. If you click on the blue arrow, you see the variables in the dataset. (Figure 1.11).
The environment pane now lists all variables included in the dataset: the df dataset (a copy of mtcars) includes mpg, cyl, disp, hp, … . You can also see that all these variables are numerical (num). The environment pane also shows the first values (observations) of each variable are shown. In the Values part, you can see that we still have a.
If you need to take a closer look at the data you can use View() with the name of the object you would like to inspect between ( ). Note that R is case sensitive: View() is not equal to view()! If you run the following command,
View(mtcars)you’ll see the dataset in the source pane (Figure 1.12):
View()
The output shows all the observations (32) and all variables (11). The names of the rows refer to the observations, the names of the columns include the variable names. If you move your cursor to the variables names, you’ll see additional information, e.g. for the variable mpg, you will see Column 1: numeric with range 10-35: mpg is a numeric variable with a minimum value of 10 and a maximum value of 35. You can sort a dataset for one variable using the two triangles next to the variable name. To undo the sort, you can click on the triangle in the column with the row names.
All data and values that are in the environment pane can be used in an R script. In other words, the environment pane shows you which datasets were imported and which variables were created during the active session. These datasets and variables can be used throughout the session. Even if you save and close the script in the editor and start a new one via file, New file, R script, you will still be able to access these datasets and variables.
To remove a dataset of variable from the session, you can use the remove function rm(). For instance, if you want to remove a from the session, you can use:
rm(a)As you removed the variable a, it is no longer shown in the environment pane. If you would use a in an calucation, e.g. a + 2, the console will show the error message
Error: object 'a' not found
The history tab in the environment pane allows you to see all code that you ran during the active session.
The fourth pane, in the right bottom shows the directory and the files in that directory. In Figure 1.11 this is a rather long directory referring to the syllabus for this course. In that folder, I have 3 sub-directories: Data-and-programming-skills, docs and Lectures. Under the files tab, you can create a new folder, delete existing folders, rename folders, … .
If you create a plot, that plot will be shown in the plot tab. To see how the Plots tab works, run the following lines in the editor (don’t worry yet if you don’t understand the code, the code is written to show where you can see a plot, not to learn you how to create a plot):
plot(hp ~ mpg,
data = mtcars, main = "Horse power and fuel economy",
xlab = "Horse power",
ylab = "Miles per gallon")The plot which is shown in the output here, also appears in the Plots tab. If you haven’t done so as part of your code, you can export the plot to a pdf or a jpeg or png file using the Export tab in the Plots pane. When we introduce the package {ggplot2} you will learn how to add titles and subtitles to the plot, change the colors of the plot, add titles to the axis, change the legend, save these plots in your code or add them to a powerpoint presentation.
The help tab allows you to search in the R documentation. If you need help on e.g. the function mean, you can type help(mean) of ?mean()in the console or you can use the search bar in the help tab.
1.2.2 Packages
1.2.2.1 Installing packages
R includes a lot of functions out of the box. We will refer to these functions as “base R” functions. However, it is almost impossible to include all possible functions that a data scientist might need in one application. Statistical procedures that you find in one field (e.g. marketing, accounting or finance) are often not used in other fields (biology, engineering or medicine). This is why R is extendable through packages.
Packages are an essential R tool as they include functions for specific applications. These functions allow you to perform a wide range of tasks e.g. importing data, tidying data, transforming data, visualizing data, statistical modelling, … . In that way, you don’t have to write all the code to perform those tasks on your own. You can find all these packages on CRAN’s list of available packages by name. If you look at this list, you’ll notice that the number of packages is huge. However, for most applications, the number of packages that you will use is limited. For that reason, R or RStudio doesn’t install all these packages out of the box, but you need to install them if you need them. To install a package, you can use the Tools menu. The first option, Install packages, allows you select the packages that you need. As an alternative, you can type use the command install.packages() in the editor or console. Within the ( ) you include the name of the packages you want to install between " ".
We will use a number of packages. For now, we will install the packages that we will use often such as {Tidyverse}, {here}, {nycflights13} and {nycflights23}. To install these packages you can enter the following line in the console, run it from the editor or select the packages in the Tools, Install packages menu in RStudio.
install.packages(c("tidyverse",
"readxl",
"glue",
"here",
"janitor",
"gt",
"officer",
"viridis",
"rnaturalearth",
"gganimate",
"sf",
"nycflights13",
"nycflights23",
"magrittr"))After installation, these packages are available to use in all your subsequent sessions. In other words, you don’t need to install them very time you use R or RStudio. The packages are updated from time to time. In other words, it is sufficient to check for updates and install those in case they are available. As with R or RStudio, keep all packages fixed as you complete the course.
The package {tidyverse} installs a number of packages that help you to
work with special types of data:
or add programming tools: purrr, Wickham and Henry (2023).
The other packages help you to
choose color scales in plots: viridis, Garnier et al. (2024), or
work with word and powerpoint: officer, Gohel, Moog, and Heckmann (2024)
use maps in your data visualisation: sf, Pebesma (2018) and rnaturalearth, Massicotte and South (2024)
The {here} package allows you to find and save files. It is especially useful in a project based workflow. We”ll use this package often to import, export and save files.
The two nycflights packages nycflights13 and nycflights23 will be used to illustrate code. These datasets include all flights that departed from New York city airports in 2013 and 2023 to airports in the United States, Puerto Rico and the American Virgin Islands. The data include information on the flights (departure time, arrival time, delays), weather, construction information for each plane, the airport names and locations, and the carrier names.
1.2.2.2 Using packages
The functions defined in the packages are not immediately available after installation. There are two ways to use them. First, you can load the package in your computer’s memory using the library() function. For instance, if your work requires you to produce a plots, you can load {ggplot2} using library(ggplot2). Using the library() function loads the package into the computer’s memory and you can use all its functions in your code without reference to the package.
A second way to use a function from a package is to call it explicitly using packagename::function() in your code. This is especially useful if you only need a limited number of functions from a package and you don’t use them often. The {here} package for instance makes opening and saving files easier. As we’ll see in the next section, it also allows you to share code with others. As you don’t import, read or save datasets, scripts or plots often as part of the code, it is not necessary to load that package in the memory of your computer. In that case it is more efficient to you use here::here() to use the here()function from the {here} package.
Let’s load the {tidyverse} package. You can copy and run the following code:
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.1 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.2
✔ purrr 1.2.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In the console, you can see that this command produces the following output (Figure 1.13):
The first part shows all packages that are loaded as part of the tidyverse package. You can now use all the functions that are defined in each of these packages. You can also see the version (e.g. ggplot 4.0.1). Second, there are a couple of conflicts. These conflicts occur because some packages include functions with the same name as functions in base R or in other packages. In this case, both {dplyr} and base R include a function filter() and lag(). This warning tells you that if you use filter() you’ll be using {dplyr}, not base R. In case you need to use base R’s filter function, you will have to instruct R to do so explicitly and use base::filter(). The same holds for the lag()function. Because you often want to use the package function and not the one from base R (why would you load a package if you don’t want to use its functions?), this is usually not a problem. However, if loading a package causes a conflict message, it is always advisable to read it carefully.
1.2.3 Quarto
Writing code, analyzing data or producing nice plots is only one part of the workflow. You also need to communicate these results with others. This is what quarto allows you to do. Quarto is an open source publishing system that allow you to mix text, code and the output from the code in one document. It allows you to create documents in many formats, including html, pdf, docx, … .
We will cover quarto more in depth in one of the last chapters, but it is useful to introduce it here as it will allow you to take notes during class.
You can open a new quarto document via File, New file and Quarto document. A new window opens (Figure 1.14).
You can add a title and your name. For now, you can use the html options and create the document. In the source pane, a new tab opens Figure 1.15:
The first part (between - - -) is called the YAML (Yet another markup language). In addition to your name and the title of the document, the YAML includes information that quarto will use to render the document (e.g. In this case, it will render as html). There are many options that you can add to the YAML, but for now, this will do. The part below the YAML is where you write your text and code. You can delete the text that you see and write your own text. You can subdivide your text using headings. If you add headings (under the normal tab), they will appear on the right of your html page.
If you include an executable cell (via the insert tab or Ctrl-alt-I (Command-option-I), you have the option to choose a programming language. In this case, you need to select R. The code block shows the programming language {r} and looks like
a <- runif(10)
mean_a <- mean(a)
mean_aIf you edit the quarto document in the source view (tab in the top left) the code will be included between ```. You can save the quarto file and give it a name (which doesn’t need to be the title in the YAML). A quarto document has a .qmd extension. If you render the document, the code will run and the document will show all the output from the code in the Viewer tab in the Files pane.
At his point, a quarto document is a good option to take notes in class. All notes that you take and code that you include will be saved in a quarto file. If allows you to add notes or comments at a later stage.
To illustrate how you can use quarto,
start a new quarto document, add
my_first_quarto_text,with the exception of the YAML part, delete all text,
write “My first quarto document shows code to produce a scatter plot for the horse power and miles per gallon variables in the mtcars dataset.”
press Ctrl-alt-I/Command-option-I , a code block opens
write or copy paste the following lines in the code block in your document
plot(hp ~ mpg,
data = mtcars,
main = "Horse power and fuel economy",
xlab = "Horse power",
ylab = "Miles per gallon")If you hit render (Figure 1.16)
quarto asks you to save your file and shows the html version of your text in your Viewer tab in the Environment pane (Figure 1.17)
Quarto is ideal to communicate your code. For instance, if you need to write a report in a team where you use R to do statistical tests, quarto files allow you to share the code as well as the output. Doing so, other team members can review and change the code, alter visualization, … . Note that one person you communicate with is “you” or the “future you”. You often switch between tasks: you prepare a report now, switch to a presentation tomorrow, … . A quarto file is a good way to explain your code: what is the purpose, why did you write it the way you did, are there any todo’s left. In doing so, you document your script (what, why, how, intended outcome) as you code. If you need to switch between tasks, a well documented workflow will make live easier as you pick up where you left.
There are two useful option: if you add #| echo: false on the line following {r}, the code will run and you’ll see the output but quarto will hide the code.
Type, below the previous part in my_first_quarto_text: “Here we use #| echo: false. The quarto document shows the output, but doesn’t show the code.” and copy paste the following lines in a new executable cell (make sure you delete the second {r} if that would show) and add the line #| echo: false on the line immediately below {r}. Make sure there is no empty line between {r} and #| echo: false.
Your code should look like Figure 1.18:
If you save and render the document, you’ll see that the documents shows the plot but doesn’t show the code.
If you don’t want to run the code (and only show the lines of the code), you can add #| eval: false in your first line.
Add the text: “If I add #| eval: false to the code, then the code is shown, but it doesn’t run.” and copy paste the following lines in a new executable cell (again, delete the second {r} is that would show and add #| eval: false below {r} without a empty line (see also Figure 1.18):
plot(hp ~ mpg,
data = mtcars,
main = "Horse power and fuel economy",
xlab = "Horse power",
ylab = "Miles per gallon")If you save and render the document, you’ll see the sentence that you wrote as well as the lines of code but your document will not show the output. This is useful is you want to show code, you don’t need to run. For instance, if you work on a team project and you would use R, adding this option would allow you to share code that causes an error.
1.2.4 Projects
An R workflows includes importing and tidying data, using the editor to write code to analyse or visualize data or to write reports. As you step through these stages, you will need to access various files (e.g. data) and you will create different files that you want to save. If you don’t organize these activities, some files will be stored in one place on your hard drive, while another files is stored elsewhere. At some point, you’ll forget where these files are.
The way you organize your files is different from the way others do. You might save all your work in a c:/user/my_name/documents folder, others save all their work in c:/my_documents/work. If you import a file stored on your computer, you will point R to the location of that file. If you do this in a script and you use absolute references (i.e. c:/user/my_name/documents/R/work) your colleagues will have a hard time running that script. On their computer, there is no folder c:/user/my_name/documents/R/work and a command such as read.csv("c:/user/my_name/documents/R/work/data/sales.csv") will fail if they run that code.
An R project is one way to solve both these issues: it allows you to store all files in one location and you don’t need absolute references to locate a file.
1.2.4.1 Creating a new project
If you start R, R will look for and store all files in the working directory on your hard drive. You can see which directory R uses at its working directory at the top of the console. That directory will also show if you run getwd():
getwd()R will show something that looks like c:/users/my_name documents/R. It is unlikely that this will be what you see as the way you organize your files on your hard drive is different from the way others do. Note that R uses the Linux and Mac / and not the Windows \\ . However, R can work with both.
The working directory is where R will look for files and save files unless you point it to another directory on your hard drive. If you use R occasionally, that is not much of a problem. However, if you use R for various tasks, your work directory will be cluttered with files: data files for all projects that you did, scripts, plots, various drafts of reports and papers, … .
This is where R projects offer a very efficient solution. You can think of an R project as a storage cabinet where you store all items related to one specific project in one cabinet, where every shelve collects related items. If you need an item for a specific project, you only have to open the project’s cabinet and look on its shelves to find what you were looking for. An R project is the cabinet; the shelves in that cabinet are the folders in the R project.
Like a storage cabinet, an R project stores all files in a separate cabinet (directory). Usually, a storage cabinet allows you to adjust the shelves. In an R project, you can create as many subdirectories as you need. In other words, R projects allow you to collect all the files in a separate directory. This directory is called the the project root directory (the cabinet). Within that directory, you can add as many subdirectories (or shelves) as you need. If you need to work on a specific project, all you need to do is open the associated R project. In doing so, R will use the project root directory as the working directory.
Technically, you could also set the work directory using the set working directory commend setwd().
setwd("c:/users/documents/R/my_new_project")For multiple reasons, this is not the most efficient thing to do. First, everytime you need to work on my_new_project you will need to run that command. Suppose that you have a project with multiple R script files. If you forget to add the command in a file and that file includes a line that saves a plot, that plot will be saved in a different directory. You would then have to run getwd() to find where that plot was saved. Second, if you collaborate on a project and you set the working directory using the setwd() command in every script, others in your team will need to change that line in your script. It is very unlikely that your working directory, c:/users/documents/R/my_new_project, will exist on their computer. Using a R project also avoids that problem.
To create a new project, follow File, New Project and RStudio will open a new window (Figure 1.19):
If you continue with New Directory and New Project, you will be able to give that project a name (new directory in the Create New Project window) and select where you want to save that project on your hard drive (browse if you want to change). If you then click on Create Project, your project will be created. If you ask for the working directory getwd()you will see that the working directory is equal to the project directory. R also add a new_project_name.Rproj file to that directory. This directory is the project’s root directory. If you click on that file, the project options window opens (Figure 1.20). Here, you need to make a couple of changes:
The first options avoids that you will reload all the data in your project (i.e. restore all variables in the environment from your previous session). For projects with large datasets, that causes a substantial slowdown loading R. Second, all variables should be created using code you save in scripts. If R restores your environment, then that could be one reason why you forget to write and save a variable into code. This is a recipe for trouble. Suppose that you want to re-use your code in another project. Variables that you created in the old project but forgot to save will be lost and you’ll have to rewrite parts of your code to create and save them.
The last options saves your history.
If you need to change projects, you can do so via File and Open Project or Recent Project.
In the files pane you can now add new directories to organize you project’s files. An example of a project structure is shown in Figure 1.21.
Although there are many possibilities to organize a project, there are a couple of general points that you can make:
Via
new filesandText file(in the files pane) add a readme.txt file to the project. You can use this file to share information on the project, the various directories, major data sources, … .As a general rule: never change your raw data. Store them in a separate (read only) file or in a separate directory. If you change your raw data files, it will be impossible to rerun your analysis and it is impossible to replicate what you did. To avoid that you overwrite a raw data file with an edited file, you can store your raw data in a
rawsubdirectory of your project’sdatadirectory. As you clean and tidy data, save these files in atidyorcleansubdirectory of your project’sdatadirectory. In doing so, you will always keep your original raw data files as you had them at the start of the project.Most projects include multiple scripts. In the example in Figure 1.21, you can see these scripts in the
scriptssubdirectory. You can organize your scripts in the order in which they have to run (as in Figure 1.21), you can add a date and organize them in the order in which they were created or changed, … . For very large projects, you can include more subdirectories: e.g. one for all your tidying and cleaning scripts, one for the analysis scripts, … .If you need to write a report, you can add a
reportssubdirectory. There you can save, e.g. the plots or tables that you need to communicate, quarto files that you need to share, or powerpoint presentations that you create within R.
For this class, you should create specific project. In that way, all your files will be stored on the same location. You can add the folders as shown in Figure 1.21: data with raw and tidy subfolders and scripts. If you use quarto for your lecture notes, you can add a folder lecture_notes and organize your quarto files using a number (01_, 02_, …) or a date and a description (e.g. lecture_intro, lecture_import, …). You can organize your scripts in the script folder in a similar way. I will assume that this project structure exists for your exam.
1.2.4.2 The {here} package
You installed the {here} package. This package makes it easy to save files in a project. Say you want to open a dataset called sales.csv which is in your project’s data directory. Here is a line of code that loads that files:
read.csv("data/sales.csv")Using the {here} package, you can rewrite that code as
read.csv(here::here("data", "my_data.csv"))The {here} package automatically builds your file’s location path relative to your projects directory. The command here("data", "my_data.csv")) builds a path starting from your project folder’s root path (the one your created in Figure 1.19). This is especially useful if you have multiple subdirectories in your project. For instance, suppose that you store your data covering sales for a specific product in a file product.csv in a directory which is organized per year and per region, e.g. .../data/2024/sales/nordics/. Using {here}, you can type
read.csv(here::here("data", "2024", "sales", "nordics" "product.csv")).
There {here} package will build the reference as if you would have written /data/2024/sales/nordics/product.csv.
Because you are in an R project, you always start from the project’s root directory. In other words, with {here} you are building paths to files relative to that directory. In that way you can share projects with colleagues. As long as their project directory structure is equal to yours, it will be irrelevant where they have created their project on their hard drive. Without a project’s ability to use relative paths, that would be impossible. If your files are in c:/users/documents/R/my_new_project and theirs are in c:/R/projects/marketing/my_new_project you would load your sales data using:
read.csv("c:/users/documents/R/my_new_project/data/sales.csv")As your collegues don’t have these folders, running your script on their computer would cause an error. To read the files, they would have to change c:/users/documents/R/my_new_project/data/sales.csv in c:/R/projects/marketing/my_new_project/data/sales.csv. If your code includes a command to save a file, that line would cause a similar error. Again, the location where you save your files would be different from theirs. With R’s projects, you avoid these errors. Because R now looks at your files relative to the project’s root folder, R will look for your files in c:/users/documents/R/my_new_project. If your colleagues use R projects as well R will look for files on their computer and start from their project’s root folder c:/R/projects/marketing/my_new_project.
1.2.4.3 Scripts
If you take another look at Figure 1.21, there are 3 directories. We will cover importing and exporting data files that you would find in and save to your data folders in Chapter 6. Reports and plots are covered in e.g. Chapter 9. Here we will focus on the use of scripts in a project.
Recall that the editor in the source pane allows you to write code and that you can save that code in a .R file (a script). You can reuse that script every time you need to run the same analysis. Suppose that you have to analyze and present sales data every month. A script would allow you to write all the code to import and tidy your sales data, run the analysis and create and save the plots and tables you need to build your powerpoint presentation. An R project can include multiple scripts. For shorter projects, you can usually include all your code in one script. However, if your code is very long, splitting up your code in various scripts is usually the best option. It allows you to collect all code relevant to one part of your work in one script. In doing so, it avoids that you have to scan through long lines of to find a specific part. In addition, it helps you to debug your code: if you see an error, it is easier to locate exactly which part of your code (i.e. which script) caused that error.
You can start a script via File and R Script or via Ctrl-Shift-N (Command-Shift-N). In the source pane, you will now see a new tab Untitled1. The editor also shows the line numbers. As you haven’t written any code, that line number for now is 1.
The editor helps you coding. For instance, (slowly) write for in the editor. RStudio’s editor shows Figure 1.22.
RStudio’s editor suggests a couple of “words” that include the word “for”. If you hit tab on the word forRStudio shows you the for loop template: a for loop requires you to include a
the word
forbetween brackets ( )
a variable
the word
ina vector
between curly brackets { } you will write your code. The code is intended with 2 spaces
RStudio’s editor also suggests where the curly brackets are: at the end of the for line and on the last line. If you need to include a for loop in your code and hit tab as RStudio suggests for, you will have the basic template ready. To illustrate, let’s fill in the parts:
i <- 1
for (i in 1:5) {
print("Hello world")
}Here the ‘variable’ in the template is i and the vector is (1, 2, 3, 4, 5). The code that R will run is between the curly brackets. It will first print “Hello world”. As we’ll see in Chapter 13, the loop will continue as long as i is smaller then or equal to 5. R increments the value of i each time it finished the code between curly brackets. The first line i <- 1 sets i equal to one. You need to add this line. To see why, run this code once. In the environment pane, you’ll see that the value of i is equal to 6. Now try to run the code again but detele i <- 1. You’ll see that is does not produce any results: as i = 6 and the for loop repeats the code only for values of i less than or equal to 5, R will not execute the code.
You’ll see these suggestions often. For instance, write the word mean in the editor. RStudio will now show additional information for the function mean() (see Figure 1.23).
RStudio’s editor also helps with brackets. In the editor, type 1 bracket ( . RStudio’s editor adds a second ) bracket. The same would happen is you use a curly bracket {, a a square bracket [ or a quotation mark ". In addition, it automatically intends code in e.g. a for loop. RStudio also makes suggestions for functions. For instance, type
vec1 <- (1:5)
mean(vec1, na.rm = TRUE)and see how many times RStudio will make a suggestion.
R will alert you to problems or errors with a red squiggly line and a red circle with a white cross in the sidebar. If you hover over the cross, RStudio will show why the error occurs. If you use Tools, Global Options and then select Code and the diagnostics tab (Figure 1.24) you can select when R will issue a warning.
A last observation: you can see that RStudio adds color to various parts of the code. For instance, the words forand in are shown in blue, the word TRUE is shown in red and the 2 subsequent brackets are in a different color. You can change the look for the editor via Tools, Global Options and then select Appearance. The appearance is personal and has no impact on the way in which RStudio runs.
Notice that the tab Untitled1 is red. That means that you haven’t saved your code yet or that you have unsaved changes to your code. You can save the script as my_first_script and save it as an .R file in your project’s scripts directory.
1.2.4.4 File and variable names
With respect to the name of your scripts or variables in your scripts there are no or little hard coded rules. However, note that R is case sensitive. In other words, var1 and Var1 are two different variables. In addition, avoid spaces or special characters such as $, €,%,&, … as some of these characters could cause problems in some applications, packages or operating systems.
To name files or variables, there are four popular case types:
Camel case (or lowerCamel case): you start a name with a small letter. If a name includes multiple words, you use a capital letter for the second, third, … word. For example: scriptToLoadLibrary, salesRawData
Pascal case (or UpperCamel case): is similar to Camel case, but includes a capital letter for the first word. For example: ScriptToLoadLibrary, SalesRawData
Snake case: all letters are small case and words are separated with an underscore. For example: script_to_load_library, sales_raw_data
Kebab Case: is similar to Snake case but uses a hyphen (-) instead of an understore (_). For example: script-to-load-library, sales-raw-data
The most important rule when naming files and variables is to be consistent and to think of a name that is both human readable as well as machine readable. A file or variable are human readable is the name suggests the content of the file or variable. For instance a file name 2024_sales_sweden_chocolates.csv is human readable as long as the data in the file refer to sales of chocolates in Sweden for the year 2024. Note that what a human can read depends on the circumstances. If you work for a firm who sells chocolates a short product code rather than a long product name could be very informative for those who work for that company. For others, it is highly unlikely that a product code is sufficient to decipher what is in a file.
With respect to variables, names such as total_sales_2024 or averageCost are informative with respect to values that these variables (obviously, assuming that the first includes the sum of sales for 2024 and the second the average cost of something). However, try to limit the length of your variable names: tot_sales_24 or aveCost reduce number of keystrokes if you write code.
If you need to import various files, it is often very convenient if you can do so in one or a couple of lines of code. If your file names are machine readable, you will be able to do so. Suppose that your data directory includes the following data files
2024_sales_sweden_chocolates.csv , 2024_sales_norway_chocolates.csv 2024_sales_finland_chocolates.csv , 2024_sales_denmark_chocolates.csv 2024_sales_sweden_cookies.csv , 2024_sales_norway_cookies.csv 2024_sales_finland_cookies.csv , 2024_sales_denmark_cookies.csv 2023_sales_sweden_chocolates.csv , 2023_sales_norway_chocolates.csv 2023_sales_finland_chocolates.csv , 2023_sales_denmark_chocolates.csv 2023_sales_sweden_cookies.csv , 2023_sales_norway_cookies.csv 2023_sales_finland_cookies.csv , 2023_sales_denmark_cookies.csv
If you need only data for Denmark, you would be able to write code that selects only those files which include denmark. Likewise if you need only the files whose name includes cookies and 2024, you will be able write code that imports only those files. The same holds for variable names. For instance, if you need to select all the columns in a dataset that include the name of a specific product, it will help if these variables include the name of that product in a consistent way or if these names all start or end with e.g. prod. Doing so, you’ll be able to select these variables with a couple of words of code.
Third, by default, files are alphabetically ordered in your directory. In your directory, a file such as import_data.R will be after generating_plots.R. Usually, you import data before you generate a plot. It could be convenient if the first file you see imports the data and the second generates the plot. If you work with multiple R scripts, you can order them if you add a number at the start of the script 01_import_data.R and 02_generating_plots.R. For plots or tables, you would order them as they appear in the text t01_name_of_table or f1_name_of_figure For reports, you can add a date before their name e.g. 20240215_report or 2024Q1_report. If you use numbers, make sure that you think about the total number of files you will need. For instance, if you think you’ll need 14 files, start the first number as 01 and not as 1; if your report includes 15 figures, use figure_01 for your first figure, … . If you do so, your files will be shown in the order in which you use them or in which they appear in the report. If you have 15 figures and you’ll see figure_1 and then figure_11 … figure_15 before you see figure_2.
With respect to variable names avoid names that are equal to names of functions (e.g. mean, summary, … ) or reserved words such as if, for, while, else, … .
Mixing cases is usually not recommended. One use case includes functions. Some people prefer to differentiate variables or files from the functions they write. They would then use e.g. snake_case for their variables but myFunction() or MyFunction() in other words (camelCase of PascaleCase) for their functions. As long as you are consistent, this is a matter of taste.
1.2.5 The assignment operator <-
Run the next line
var1 <- 25You should read this statement as “the object var1 is assigned the value of 25” or “the object var1 gets the value of 25”. In this case, the object is a numerical variable, but could be any other object in R: a vector, matrix, data frame, a plot or a table. If you assign a value to an object, you can use the object in your code. For instance, if you multiply var1 with 4
var1 * 4[1] 100
the result is 100. So, as long as you don’t change the value of var1, it’s value will be 25.
The assignment operator <- (arrow pointing left and minus sign) is widely used in R to assign a value to an object. Recall for instance that you wrote df <- mtcars. Here this statement says that we will assign the dataset mtcars to an object that we will call df. In doing so, the object df is a copy of the dataset mtcars. If you assign a value to a an object, you’ll see that object in the environment pane. If you ran the code var1 <- 25 your environment pane showed in the Values section var1 and a value of 25. If you ran df <- mtcars, you saw the object df in the Data section of the environment pane.
Technically = would do the same trick. For instance,
var2 = 125will create a variable whose value is 125 (see its value in the environment pane). If you do math
var2/25[1] 5
you’ll see that the result is equal to 125/25.
As a rule, if you want to assign a value to an object, always use the assignment operator. We will use = often but almost always within a function. To see the importance, let’s create a vector with 10 (n = 10) random draws from a normal distribution (rnorm(): r for random, norm for normal distribution) with mean 2 (mean = 2) and standard deviation 4 (sd = 4):
vec1 <- rnorm(n = 10, mean = 2, sd = 4)If you run this line, you’ll see that vec1 shows up in the environment. Now run the next line
vec2 <- rnorm(n = 10, mean <- 2, sd <- 4)Note that the environment pane now shows vec2 as well as mean and sd. These are now assigned a value: 2 for mean and 4 for sd. However, in this case, we don’t want to assign a value of 2 to the mean. What we want is a random draw from a normal distribution with mean 2. In other words, we don’t want R to remember that mean was set to 2 to draw the random numbers. Within a function, using = tells R it can forget what mean was as soon as the function is completed. In doing so, the value mean with a value of 2 will only exist within the function but will not have any impact once the function is complete.
We met a similar case when we wrote the code to draw the horse power - miles per gallon plot:
plot(hp ~ mpg,
data = mtcars,
main = "Horse power and fuel economy",
xlab = "Horse power",
ylab = "Miles per gallon")In this code, we didn’t use the <- in the data = mtcars part. Again, we don’t want R to remember that data was set equal to mtcars to produce the plot. If we would have used the assignment operator then the mtcars dataset would be have been assigned to the object data and would have shown up in the environment pane. In other words, that variable would continue to live outside of the function call.
As a result, almost all people who use R use the assingment operator <- to assign a value to an object and use = inside function. In that case, you avoid that you create a variable such as mean which continues to live beyond the function call in your environment. With the assignment operator, you are explicit in your goal: you want to assign a value to an object (a vector to an object vec1, a dataset to a on object which we will call a data frame df, … .
You can also use ->:
3 -> var3However, in most cases, you would use <- as its aligns with the way you read your code: first the object’s name followed by the value.
Let’s create two variables, var1 and assign it the value of 125 and var2 two times the value of var1:
var1 <- 125
var2 <- 2 * var1As you would expect, var1 is 125 and var2 equals 250. Now change the value of var1 from 125 in 25:
var1 <- 25and check the value of var2:
var2[1] 250
As you can see, the value of var2 was 250 before we changes the value of var1 and is still 250 after we changed var1’s value. This shows that the value assigned to a variable or object equals the value at the time the expression was evaluated. Here, the expression var2 <- 2 * var1 was evaluated with var1 equal to 125. Changing the value of var1 after this evaluation does not affect the value of var2. In other words, once created, var2 forgets where it came from. It does not remember that is was the result of the expression 2 * var1 but only remembers its value when this expression was evaluated.
1.2.6 Other conventions and style guide
1.2.6.1 Case senstivity and decimals
I already mentioned the fact that R is case sensitive. In other words, my_var1and My_var1 are two different objects.
my_var1 <- 100
My_var1 <- 1000
2 * my_var1[1] 200
2 * My_var1[1] 2000
If you write this code in the editor, RStudio suggests auto completions after you have typed the first 3 letters, my_ and will show both my_var1 and My_var1. If you use both, this will almost surely cause mistakes. You can avoid these mistakes if you are consistent in your use of capitals for your variable names.
Second, R uses a point for decimals: 10.25 and not 10,25.
a <- 10.25if you use a , you will see an error:
b <- 10,25Error in parse(text = input): <text>:1:8: unexpected ','
1: b <- 10,
^
1.2.6.2 Style guide
You’ll write code, but you’ll often need to read code as well. To help people, including your future self, with the last activity, there are a couple of conventions (i.e. a style guide). In English or in any other language, you are familiar with the fact that a sentence starts with a capital and ends with a dot, a question or exclamation mark; that you add a white space after a comma or that a new paragraph always starts on a new line. The R style guide has similar rules.
In R, like in any other language, you will use commas. As in most other languages, you use a space after a comma, but never before. For instance, you would write
x[, 1]but not
x[,1]
x[ ,1]
x[ , 1]In function calls, you don’t add a space before of after parentheses. For instance, you write
sd(x, na.rm = TRUE)but not
sd (x, na.rm = TRUE)
sd( x, na.rm = TRUE )
sd ( x, na.rm = TRUE ) This is not always the case. If you use e.g. for, while or if, you would use a space before the opening parenthesis( and after the closing parenthesis ):
for (i in 1:10) {
do_something_with(i)
}
if (condition) {
do_somehting_with(x)
}
while (i < 100) {
do_something_with(i)
}We’ll write our own functions. In that case, you add a space after the closing parenthesis ) of the function call:
function(a,b) { }You wouldn’t write
function(a,b){ }
function (a,b){ }
function (a,b) { }For other operators such as +, -, /, =, |>, %>% or *, you can often use a space before and after the operator:
x <- 4
mean(x, na.rm = TRUE)
a == b
2 + 4
6 - 8
24 / 6
1 * 4
x |> y
z %>% qThere are a couple of exceptions.
2^2
1:10
here::here
~x
!=Curley braces { } in code blocks show a hierarchy in R. To show this more explicit, there are a couple of conventions:
after e.g.
ifand the condition orforand the loop,{should be the last characterthe next lines should be intended with 2 spaces
the closing
}should be the first character on the line
if (condition) {
do_something
} else {
do_something_else
}This is especially useful if you have more than one of these conditions:
for (i in x) {
if (condition) {
do_somthing
} else {
if (another_condition) {
do_something
} else {
do_something_else
}
do_something_else
}
}Here you immediately see the hierarchy. Every condition is within curley braces and intended with 2 spaces.
Sometimes function calls are long. This is usually the case if you have many named arguments (e.g. mean = 5 in the function rnorm(). In general, it is a good idea to keep the length of a single line short. In that case, you can use multiple lines. For instance you could use
print(paste0("The mean of the vector is equal to ",
round(mean(1:250),
digits = 4)))instead of
print(paste0("The mean of the vector is equal to ", round(mean(1:250), digits = 4)))The same holds for code using the pipe operator. To avoid lengthy lines, if is often a good idea to start a new line after every |>
df |>
filter(y == 25) |>
ggplot(aes(x = something, y = something)) +
geom_line()sometimes an exception is made for the first line because this line usually starts with the dataset you want to use. You can add your first operation and have two pipes on the same line:
df |> filter(y == 25) |>
ggplot(aes(x = something, y = something)) +
geom_line()The same holds for a line that starts with an assignment
plot_df <- df |> filter(y == 25) |>
ggplot(aes(x = something, y = something)) +
geom_line()However, try to keep the first line as short as possible (i.e. not add too many pipes).